Selective Annotation of Sentence Parts: Identification of Relevant Sub-sentential Units
نویسندگان
چکیده
Many NLP tasks involve sentence-level annotation yet the relevant information is not encoded at sentence level but at some relevant parts of the sentence. Such tasks include but are not limited to: sentiment expression annotation, product feature annotation, and template annotation for Q&A systems. However, annotation of the full corpus sentence by sentence is resource intensive. In this paper, we propose an approach that iteratively extracts frequent parts of sentences for annotating, and compresses the set of sentences after each round of annotation. Our approach can also be used in preparing training sentences for binary classification (domain-related vs. noise, subjectivity vs. objectivity, etc.), assuming that sentence-type annotation can be predicted by annotation of the most relevant sub-sentences. Two experiments are performed to test our proposal and evaluated in terms of time saved and agreement of annotation.
منابع مشابه
An Efficient Framework to Extract Parallel Units from Comparable Data
Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora. However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments. Instead, we present an efficient framework to extract both ...
متن کاملNovel elicitation and annotation schemes for sentential and sub-sentential alignments of bitexts
Resources for evaluating sentence-level and word-level alignment algorithms are unsatisfactory. Regarding sentence alignments, the existing data is too scarce, especially when it comes to difficult bitexts, containing instances of non-literal translations. Regarding word-level alignments, most available hand-aligned data provide a complete annotation at the level of words that is difficult to e...
متن کاملSemi-Automatic Annotation of Intra-Sentential Discourse Relations in PDT
In the present paper, we describe in detail and evaluate the process of semi-automatic annotation of intra-sentential discourse relations in the Prague Dependency Treebank, which is a part of the project of otherwise mostly manual annotation of all (intraand inter-sentential) discourse relations with explicit connectives in the treebank. Our assumption that some syntactic features of a sentence...
متن کاملAn annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملTowards a Closer Integration of Termbases, Translation Memories, and Parallel Corpora: -A Translation-Oriented View-
This paper takes a look at how the use of terminological information and bilingual corpora of previously translated texts can improve the performance of translation memories. The focus is on using terminology to support sub-sentential alignment. The author tries to show that the performance of translation memories will not benefit significantly from generalizing the units stored in the memory b...
متن کامل